5. Dimensionality reduction techniques

Analysis

Before we get started with PCA, I will describe the dataset briefly. First of all, we have 8 numeric variables and 155 observations. The data contains variables regarding Human development and Gender inequality in 155 countries.

Let’s take a look at the data:

source("create_human_part2.R")
library(GGally)

human$GNI <- gsub(",", "", human$GNI)
human[, 5] <- sapply(human[, 5], as.numeric)

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : num  64992 42261 56431 44025 45435 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
glimpse(human)
## Observations: 155
## Variables: 8
## $ Edu2.FM   <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.969060...
## $ Labo.FM   <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.828611...
## $ Edu.Exp   <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, 15.9...
## $ Life.Exp  <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, 82.0...
## $ GNI       <dbl> 64992, 42261, 56431, 44025, 45435, 43919, 39568, 529...
## $ Mat.Mor   <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, 2, 1...
## $ Ado.Birth <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, 25.3...
## $ Parli.F   <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, 28.2...
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

As we can see our variables are in different scales. The variables we use in the analysis mean following:

The median of women in labour force (Labo.FM) is 0.7074, which means that fewer women than men take part in the labour force.

Let’s take a deeper look at the connections between variables.

# custom ggpairs from https://github.com/ggobi/ggally/issues/139
library(RColorBrewer)
corColors <- RColorBrewer::brewer.pal(n = 7, name = "RdYlGn")[2:6]
corColors
## [1] "#FC8D59" "#FEE08B" "#FFFFBF" "#D9EF8B" "#91CF60"
my_custom_cor_color <- function(data, mapping, color = I("black"), sizeRange = c(1, 5), ...) {
  
  # get the x and y data to use the other code
  x <- eval(mapping$x, data)
  y <- eval(mapping$y, data)
  
  ct <- cor.test(x,y)

  r <- unname(ct$estimate)
  rt <- format(r, digits=2)[1]
  tt <- as.character(rt)
  
  # plot the cor value
  p <- ggally_text(
    label = tt, 
    mapping = aes(),
    xP = 0.5, yP = 0.5, 
    size = 6,
    color=color,
    ...
  ) +
    
    theme(
      panel.background=element_rect(fill="white", color = "black"),
      panel.grid.minor=element_blank(),
      panel.grid.major=element_blank()
    ) 
  
  corColors <- RColorBrewer::brewer.pal(n = 7, name = "RdYlGn")[2:6]
  
  if (r <= -0.8) {
    corCol <- corColors[1]
  } else if (r <= -0.6) {
    corCol <- corColors[2]
  } else if (r < 0.6) {
    corCol <- corColors[3]
  } else if (r < 0.8) {
    corCol <- corColors[4]
  } else {
    corCol <- corColors[5]
  }
  p <- p + theme(
    panel.background = element_rect(fill= corCol)
  )
  p
}


ggpairs(
 human,
  columns = 1:8,
  upper = list(continuous = my_custom_cor_color),
  diag = list(combo = "facethist"),
  lower = list(
    combo = wrap("facethist", bins = 100)
  )
)

We can see that the strongest negative correlation is between life expectency and maternal mortality (higher the other lower the other). Education and life expectancy are on the other spectrum the most highly correlated with each other (0.79), GNI to life expectancy is only 0.69.

Let’s do a few histograms and then move to PCA.

library(reshape2)
d <- melt(human)
ggplot(d,aes(x = value)) + 
  facet_wrap(~variable,scales = "free_x") + 
  geom_histogram()

First we perform PCA without scaling the variables. We should expect bad results from the model as variables get very different values.

nonsta_human_pca <- prcomp(human)
s <- summary(nonsta_human_pca)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")


biplot(nonsta_human_pca, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

As you can see PCA went badly wrong without scaling the variables. PCA1 first component captures 100% of the total variance.

Now let’s do the same but first we scale the variables.

# Now we scale the variables
human_std <- scale(human)
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# Perform pca
pca_human <- prcomp(human_std)

s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")



biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Results are quite different. Ado.Birth and Mat.Mor are positively correlated with PC1, as are Life.Exp, Edu2.FM and Edu.Exp, but the first two mentioned are negatively correlated with the rest four. We can see Mat.Mor and Ado.Birth indicating to the direction of Sierra Leone and Chad and other developing countries.

Parli.F and Labo.FM have no impact to PC1 component which now accounts for 53,6% of the total variance. Parli.F and Labo.FM are correlated positively with eachother and PC2 component.

library(FactoMineR)
data(tea)

glimpse(tea)
## Observations: 300
## Variables: 36
## $ breakfast        <fctr> breakfast, breakfast, Not.breakfast, Not.bre...
## $ tea.time         <fctr> Not.tea time, Not.tea time, tea time, Not.te...
## $ evening          <fctr> Not.evening, Not.evening, evening, Not.eveni...
## $ lunch            <fctr> Not.lunch, Not.lunch, Not.lunch, Not.lunch, ...
## $ dinner           <fctr> Not.dinner, Not.dinner, dinner, dinner, Not....
## $ always           <fctr> Not.always, Not.always, Not.always, Not.alwa...
## $ home             <fctr> home, home, home, home, home, home, home, ho...
## $ work             <fctr> Not.work, Not.work, work, Not.work, Not.work...
## $ tearoom          <fctr> Not.tearoom, Not.tearoom, Not.tearoom, Not.t...
## $ friends          <fctr> Not.friends, Not.friends, friends, Not.frien...
## $ resto            <fctr> Not.resto, Not.resto, resto, Not.resto, Not....
## $ pub              <fctr> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub,...
## $ Tea              <fctr> black, black, Earl Grey, Earl Grey, Earl Gre...
## $ How              <fctr> alone, milk, alone, alone, alone, alone, alo...
## $ sugar            <fctr> sugar, No.sugar, No.sugar, sugar, No.sugar, ...
## $ how              <fctr> tea bag, tea bag, tea bag, tea bag, tea bag,...
## $ where            <fctr> chain store, chain store, chain store, chain...
## $ price            <fctr> p_unknown, p_variable, p_variable, p_variabl...
## $ age              <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 3...
## $ sex              <fctr> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M,...
## $ SPC              <fctr> middle, middle, other worker, student, emplo...
## $ Sport            <fctr> sportsman, sportsman, sportsman, Not.sportsm...
## $ age_Q            <fctr> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35...
## $ frequency        <fctr> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, ...
## $ escape.exoticism <fctr> Not.escape-exoticism, escape-exoticism, Not....
## $ spirituality     <fctr> Not.spirituality, Not.spirituality, Not.spir...
## $ healthy          <fctr> healthy, healthy, healthy, healthy, Not.heal...
## $ diuretic         <fctr> Not.diuretic, diuretic, diuretic, Not.diuret...
## $ friendliness     <fctr> Not.friendliness, Not.friendliness, friendli...
## $ iron.absorption  <fctr> Not.iron absorption, Not.iron absorption, No...
## $ feminine         <fctr> Not.feminine, Not.feminine, Not.feminine, No...
## $ sophisticated    <fctr> Not.sophisticated, Not.sophisticated, Not.so...
## $ slimming         <fctr> No.slimming, No.slimming, No.slimming, No.sl...
## $ exciting         <fctr> No.exciting, exciting, No.exciting, No.excit...
## $ relaxing         <fctr> No.relaxing, No.relaxing, relaxing, relaxing...
## $ effect.on.health <fctr> No.effect on health, No.effect on health, No...
dim(tea)
## [1] 300  36
# plotting factor data 
ggplot(data = tea, aes(x = Tea, y = effect.on.health, colour =sugar)) + geom_point() + geom_jitter()

# column names to keep in the dataset
keep_columns <- c("Tea", "age_Q", "how", "spirituality", "feminine", "home")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea        age_Q                    how     
##  black    : 74   15-24:92   tea bag           :170  
##  Earl Grey:193   25-34:69   tea bag+unpackaged: 94  
##  green    : 33   35-44:40   unpackaged        : 36  
##                  45-59:61                           
##                  +60  :38                           
##            spirituality         feminine         home    
##  Not.spirituality:206   feminine    :129   home    :291  
##  spirituality    : 94   Not.feminine:171   Not.home:  9  
##                                                          
##                                                          
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea         : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ age_Q       : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ how         : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ spirituality: Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ feminine    : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ home        : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
# visualize the dataset
library(tidyr)
gather(tea_time) %>% ggplot(aes(value, fill=key)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Let’s perform an analysis with the chosen variables about the tea data.

mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.256   0.214   0.196   0.182   0.172   0.161
## % of var.             13.944  11.649  10.714   9.939   9.362   8.762
## Cumulative % of var.  13.944  25.592  36.307  46.246  55.608  64.369
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.153   0.142   0.138   0.122   0.098
## % of var.              8.364   7.729   7.543   6.655   5.339
## Cumulative % of var.  72.733  80.463  88.006  94.661 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  |  0.618  0.498  0.198 | -0.191  0.057  0.019 | -0.745
## 2                  |  0.597  0.465  0.238 | -0.090  0.013  0.005 |  0.015
## 3                  |  0.043  0.002  0.002 | -0.013  0.000  0.000 |  0.327
## 4                  | -0.736  0.706  0.496 | -0.199  0.062  0.036 | -0.121
## 5                  | -0.264  0.091  0.051 | -0.043  0.003  0.001 |  0.261
## 6                  | -0.429  0.240  0.229 | -0.170  0.045  0.036 | -0.055
## 7                  |  0.064  0.005  0.003 | -0.115  0.020  0.009 | -0.433
## 8                  |  0.602  0.473  0.179 | -0.583  0.531  0.168 | -0.497
## 9                  |  0.003  0.000  0.000 |  0.308  0.148  0.054 | -0.791
## 10                 |  0.558  0.405  0.144 |  0.231  0.083  0.025 | -1.103
##                       ctr   cos2  
## 1                   0.942  0.288 |
## 2                   0.000  0.000 |
## 3                   0.181  0.099 |
## 4                   0.025  0.013 |
## 5                   0.115  0.050 |
## 6                   0.005  0.004 |
## 7                   0.318  0.124 |
## 8                   0.419  0.122 |
## 9                   1.062  0.358 |
## 10                  2.065  0.562 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   1.136  20.759   0.423  11.242 |  -0.186   0.664
## Earl Grey          |  -0.545  12.474   0.536 -12.665 |   0.027   0.038
## green              |   0.642   2.953   0.051   3.901 |   0.256   0.563
## 15-24              |  -0.965  18.610   0.412 -11.095 |  -0.408   3.975
## 25-34              |  -0.091   0.125   0.002  -0.864 |   1.187  25.275
## 35-44              |   0.530   2.440   0.043   3.593 |  -0.254   0.673
## 45-59              |   0.466   2.884   0.056   4.075 |   0.027   0.012
## +60                |   1.195  11.801   0.207   7.872 |  -0.944   8.818
## tea bag            |  -0.112   0.466   0.016  -2.221 |  -0.523  12.089
## tea bag+unpackaged |  -0.296   1.795   0.040  -3.462 |   0.649  10.286
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.011  -1.838 |  -0.670   9.399   0.147  -6.631 |
## Earl Grey            0.001   0.637 |   0.159   1.379   0.046   3.691 |
## green                0.008   1.557 |   0.573   3.067   0.041   3.485 |
## 15-24                0.073  -4.686 |  -0.264   1.820   0.031  -3.042 |
## 25-34                0.421  11.214 |   0.253   1.253   0.019   2.395 |
## 35-44                0.010  -1.725 |  -1.270  18.256   0.248  -8.616 |
## 45-59                0.000   0.240 |   0.751   9.729   0.144   6.560 |
## +60                  0.129  -6.220 |   0.312   1.046   0.014   2.054 |
## tea bag              0.357 -10.338 |   0.287   3.972   0.108   5.683 |
## tea bag+unpackaged   0.192   7.576 |  -0.665  11.747   0.202  -7.764 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.555 0.016 0.163 |
## age_Q              | 0.550 0.497 0.378 |
## how                | 0.239 0.359 0.202 |
## spirituality       | 0.187 0.001 0.007 |
## feminine           | 0.001 0.289 0.107 |
## home               | 0.003 0.119 0.321 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")


4. Clustering and classification

Analysis

In this week’s exercise we will look at clustering and classification with the Boston dataset from the MASS library. Let’s begin by importing the data.

initsetup <- function(k){
  setwd(k)
  library(Matrix)
  library(ggplot2)
  library(dplyr)
  library(MASS)
  library(GGally)
}
initsetup("/Users/veikko/Desktop/datakurssi/IODS-project/data")
getwd()
## [1] "/Users/veikko/Desktop/datakurssi/IODS-project/data"
rm(list=ls())

data("Boston")

# summaries of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Here is the list of variables in the dataset:

  1. crim – per capita crime rate by town.
  2. zn – proportion of residential land zoned for lots over 25,000 sq.ft.
  3. indus – proportion of non-retail business acres per town.
  4. chas – Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  5. nox – nitrogen oxides concentration (parts per 10 million).
  6. rm – average number of rooms per dwelling.
  7. age – proportion of owner-occupied units built prior to 1940.
  8. dis – weighted mean of distances to five Boston employment centres.
  9. rad – index of accessibility to radial highways.
  10. tax – full-value property-tax rate per $10,000.
  11. ptratio – pupil-teacher ratio by town.
  12. black – 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  13. lstat – lower status of the population (percent).
  14. medv – median value of owner-occupied homes in $1000s.

Summary and str functions revealed that we have one categorical variable (chas) and one discrete variable (rad). black, crim and zn have very large differences in max and min values, but 1st to 3rd quarter they seem to have belieavable values. One can also note that all variables are in different scales.

Let’s explore the distributions and correlations of different variables in the data set by using ggpairs function.

# there are at least two categorical variables in the data
B1 <- Boston
B1$rad <-  as.factor(Boston$rad)
B1$chas <- as.factor(ifelse(Boston$chas==1, "river", "land"))

# custom ggpairs from https://github.com/ggobi/ggally/issues/139
library(RColorBrewer)
corColors <- RColorBrewer::brewer.pal(n = 7, name = "RdYlGn")[2:6]
corColors
## [1] "#FC8D59" "#FEE08B" "#FFFFBF" "#D9EF8B" "#91CF60"
my_custom_cor_color <- function(data, mapping, color = I("black"), sizeRange = c(1, 5), ...) {
  
  # get the x and y data to use the other code
  x <- eval(mapping$x, data)
  y <- eval(mapping$y, data)
  
  ct <- cor.test(x,y)

  r <- unname(ct$estimate)
  rt <- format(r, digits=2)[1]
  tt <- as.character(rt)
  
  # plot the cor value
  p <- ggally_text(
    label = tt, 
    mapping = aes(),
    xP = 0.5, yP = 0.5, 
    size = 6,
    color=color,
    ...
  ) +
    
    theme(
      panel.background=element_rect(fill="white", color = "black"),
      panel.grid.minor=element_blank(),
      panel.grid.major=element_blank()
    ) 
  
  corColors <- RColorBrewer::brewer.pal(n = 7, name = "RdYlGn")[2:6]
  
  if (r <= -0.8) {
    corCol <- corColors[1]
  } else if (r <= -0.6) {
    corCol <- corColors[2]
  } else if (r < 0.6) {
    corCol <- corColors[3]
  } else if (r < 0.8) {
    corCol <- corColors[4]
  } else {
    corCol <- corColors[5]
  }
  p <- p + theme(
    panel.background = element_rect(fill= corCol)
  )
  p
}


ggpairs(
  B1,
  columns = 1:14,
  upper = list(continuous = my_custom_cor_color),
  diag = list(combo = "facethist"),
  lower = list(
    combo = wrap("facethist", bins = 100)
  )
)

Here we can see correlations much easier than with basic ggpairs settings. With the basic settings there are just so many variables which makes the plot difficult to interpret without any color coding. Here one has to remember that I did not test the correlations with statistical signifficance.

Some insights from the ggpairs:

if we look at the medv variable, which indicates median value of appartmens, we can see that it correlates very negatively with crime and lstat. This means that more valuable the appartment the less crime and social classes there are. One can also see that age and lstat correlate positively with each other which means that lower social status is connected with older buildings in the neighbourhood. There is also very strong positive correlation between taxand indus. This indicates high property tax yields are present in areas with high levels of industrial activity.

# x-mean(x) / sd(x)
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)

# creating a factor variable
scaled_crim <- boston_scaled[, "crim"]
summary(scaled_crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  0.000000  0.007389  9.924000
# bins <- quantile(scaled_crim)

crime <- as.numeric(cut_number(scaled_crim,4)) %>%
  factor(labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove old and add new
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Above I did many things: 1. scaled all variables in Boston dataset (note that all variables need to be numerical) 2. summarized new scaled dataset 3. conversed scaled data back to a data frame 4. then I cut the scaled_crim quantiles and created a new factor variable crime 5. finally, I removed old numeric variable and replaced it with the categorical variable

As we can see, the variables are now all scaled, meaning for each value: \((x -mean(x))/sd(x)\) Variables now obtain negative values as their min values till mean gets the value of zero and after mean value all values are positive.

Next, I will divide the data into a training set and a test set, so that 80% of the datapoints will be in the training set. I ’ll also remove the real values of crime variable from the test set into another variable of its own called correct_classes.

# creating training and test set
set.seed(12)
ind <- nrow(boston_scaled) %>%
  sample(. * 0.8)

train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

correct_classes <- test[, "crime"]
# test set ready for prediction
test <- dplyr::select(test, -crime)

Now we turn into performing linear discriminant analysis fitting on the training set. We will also visualize the results.

# fitting the model on the training set
lda.fit <- lda(crime ~. , data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2574257 0.2400990 0.2400990 
## 
## Group means:
##                   zn      indus        chas        nox         rm
## low       0.94307637 -0.9127350 -0.08661679 -0.8744482  0.4412043
## med_low  -0.09454334 -0.3418261  0.03052480 -0.5929899 -0.1265458
## med_high -0.37938430  0.1884615  0.25532354  0.3595668  0.1475993
## high     -0.48724019  1.0172187 -0.06938576  1.0430995 -0.4440326
##                 age        dis        rad        tax     ptratio
## low      -0.9260201  0.8575468 -0.6904206 -0.7365904 -0.45879651
## med_low  -0.3434658  0.3686088 -0.5478832 -0.5261717 -0.09316052
## med_high  0.4063941 -0.3494648 -0.4029017 -0.3114126 -0.30231831
## high      0.8250080 -0.8654942  1.6371072  1.5133254  0.77958792
##                black       lstat        medv
## low       0.37693462 -0.76152029  0.54562143
## med_low   0.31593747 -0.13281609  0.00270585
## med_high  0.07074154 -0.02403283  0.19304349
## high     -0.80338538  0.95065549 -0.70280102
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.07707055  0.737746735 -0.87329135
## indus    0.06895628 -0.349487786  0.25520750
## chas    -0.09673761 -0.051237460  0.09856258
## nox      0.40432895 -0.593237949 -1.47606017
## rm      -0.09136114 -0.087050917 -0.12198778
## age      0.23344441 -0.395445903 -0.04306557
## dis     -0.03332099 -0.326460028  0.06214625
## rad      3.09520048  0.944267182  0.13844589
## tax      0.04398345 -0.022798137  0.35384722
## ptratio  0.08659163  0.084704616 -0.30507815
## black   -0.11029035  0.009633805  0.14947037
## lstat    0.27733959 -0.230885707  0.33364507
## medv     0.23149514 -0.353311333 -0.30490501
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9486 0.0384 0.0129
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

Here we perform the prediction with the trained LDA model. Then we will cross tabulate the classes of the prediction versus correct classes.

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12       8        1    0
##   med_low    3      12        7    0
##   med_high   0       9       19    1
##   high       0       0        0   30

As we can see, our LDA prediction was good at predicting extreme values, i.e. high and low crime having not that many values out of the diagonal, as there are only one misclassification in high class and three in low class. med_low and med_high had more issues in determining the correct classes.

Let’s perform K-means clustering algorithm on the dataset. For this procedure we need to calculate distances between observations and the dataset needs to be scaled to get comparable results. Then we shall determine the optimal number of clusters, by visually looking at distance measures with different k. After that we visualize the dataset by coloring the observations by their cluster.

data('Boston')
B2 <- scale(Boston)
B2 <- as.data.frame(B2)

dist_eu <- dist(B2, method = "euclidean")
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
dist_man <- dist(B2, method = "manhattan")
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4830 12.6100 13.5500 17.7600 48.8600
# let's determine the optimal number of clusters
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')

We can clearly see that the greatest drop in the plot happens with k=2. So, we choose k=2 to be the optimal number of clusters.

# creating plots to interpret results
km <-kmeans(dist_eu, centers = 2)
asd <- as.data.frame(km$cluster) %>%
  .[, 1] %>%
  as.factor(.)

Boston["clusters"] <- asd
ggpairs(Boston, mapping = aes(col = clusters, alpha = 0.3), columns = 1:14, 
              lower = list(combo = wrap("facethist", bins = 100)))

# histograms
library(reshape2)
d <- melt(Boston[,-c(4)])
ggplot(d,aes(x = value, fill=clusters)) + 
  facet_wrap(~variable,scales = "free_x") + 
  geom_histogram()

It seems that crime low crime is almost always classified as blue and high crime red against all variables. We can see that in some dimensions classes are somewhat mixed, but in some there is a clear division between classes.

Next, we can move onto the first BONUS task.

# dist_eu is calculated from scaled Boston dataset called B2
km <-kmeans(dist_eu, centers = 4)
asd <- as.data.frame(km$cluster) %>%
  .[, 1] %>%
  as.factor(.)

B2["clusters"] <- asd

lda.fit2 <- lda(clusters ~., data = B2)
lda.fit2
## Call:
## lda(clusters ~ ., data = B2)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.4308300 0.1304348 0.2272727 0.2114625 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm
## 1 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814
## 2  1.4330759 -0.4872402  1.0689719  0.4435073  1.3439101 -0.7461469
## 3  0.2797949 -0.4872402  1.1892663 -0.2723291  0.8998296 -0.2770011
## 4 -0.3912182  1.2671159 -0.8754697  0.5739635 -0.7359091  0.9938426
##          age        dis        rad        tax     ptratio       black
## 1 -0.3256000  0.3182404 -0.5741127 -0.6240070  0.02986213  0.34248644
## 2  0.8575386 -0.9620552  1.2941816  1.2970210  0.42015742 -1.65562038
## 3  0.7716696 -0.7723199  0.9006160  1.0311612  0.60093343 -0.01717546
## 4 -0.6949417  0.7751031 -0.5965444 -0.6369476 -0.96586616  0.34190729
##        lstat        medv
## 1 -0.2813666 -0.01314324
## 2  1.1930953 -0.81904111
## 3  0.6116223 -0.54636549
## 4 -0.8200275  1.11919598
## 
## Coefficients of linear discriminants:
##                 LD1        LD2         LD3
## crim     0.18113078  0.5012256  0.60535205
## zn       0.43297497  1.0486194 -0.67406151
## indus    1.37753200 -0.3016928 -1.07034034
## chas    -0.04307937  0.7598229  0.22448239
## nox      1.04674638  0.3861005  0.33268952
## rm      -0.14912869  0.1510367 -0.67942589
## age     -0.09897424 -0.0523110 -0.26285587
## dis      0.13139210  0.1593367  0.03487882
## rad      0.65824136 -0.5189795 -0.48145070
## tax      0.28903561  0.5773959 -0.10350513
## ptratio  0.22236843 -0.1668597  0.09181715
## black   -0.42730704 -0.5843973 -0.89869354
## lstat    0.24320629  0.6197780  0.01119242
## medv     0.21961575  0.9485829  0.17065360
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7596 0.1768 0.0636
# target classes as numeric
classes <- as.numeric(B2$clusters)
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit2, myscale = 6)

This looks interesting. The biplot shows us which variables are dragging the observations into certain clusters. For instance crime, nox, medw variables are pulling the observation towards class 2, rad towards class 3, black towards class 1.

Finally, let’s have a try at the Super-Bonus task.

# super bonus
model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, 
        z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, 
        z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = Boston$clusters[ind])

The two plots have similarities in terms of their grouping of observations. It seems that high and med_high are clustered together as class 1 and low and med_low as class 2. There are some misclassifications (as earlier seen in cross tabulation), but this visualizes very nicely, how to spot individual misclassifications.


3. Logistic Regression

Data Wrangling

I started off this week’s exercise with data wrangling. I read the data from UCI Machine Learning Repository.

initialsetup <- function(projectWD){
  setwd(projectWD)
  library(magrittr)
  library(dplyr)
  library(tidyr)
  library(ggplot2)
}
k <- "/Users/veikko/Desktop/datakurssi/IODS-project/data"
initialsetup(k)
getwd()
## [1] "/Users/veikko/Desktop/datakurssi/IODS-project/data"
math <- read.csv("student-mat.csv", header = T, sep = ";")
por <- read.csv("student-por.csv", header = T, sep = ";")

dim(math)
## [1] 395  33
dim(por)
## [1] 649  33
colnames(math)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
## [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"
head(math)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          2         2        0       yes     no   no
## 2     course   father          1         2        0        no    yes   no
## 3      other   mother          1         2        3       yes     no  yes
## 4       home   mother          1         3        0        no    yes  yes
## 5       home   father          1         2        0        no    yes  yes
## 6 reputation   mother          1         2        0        no    yes  yes
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     2    2
## 4        yes     yes    yes      yes      yes      3        2     2    1
## 5         no     yes    yes       no       no      4        3     2    1
## 6        yes     yes    yes      yes       no      5        4     2    1
##   Walc health absences G1 G2 G3
## 1    1      3        6  5  6  6
## 2    1      3        4  5  5  6
## 3    3      3       10  7  8 10
## 4    1      5        2 15 14 15
## 5    2      5        4  6 10 10
## 6    2      5       10 15 15 15

After inspecting the data I continued by combining mat and por datasets with predetermined variables. These variables were “school”, “sex”, “age”, “address”, “famsize”, “Pstatus”, “Medu”, “Fedu”, “Mjob”, “Fjob”, “reason”, “nursery” and “internet”.

join_by <- c("school","sex","age","address","famsize","Pstatus","Medu",
             "Fedu","Mjob","Fjob","reason","nursery","internet")

#alternatively: math_por <- inner_join(math, por, by = join_by, suffix = c(".math", ".por"))
math_por <- math %>%
  left_join(por, by = join_by, suffix = c(".math", ".por")) %>%
  drop_na()

glimpse(math_por)
## Observations: 382
## Variables: 53
## $ school          <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G...
## $ sex             <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, ...
## $ age             <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15...
## $ address         <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize         <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, ...
## $ Pstatus         <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, ...
## $ Medu            <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4...
## $ Fedu            <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4...
## $ Mjob            <fctr> at_home, at_home, at_home, health, other, ser...
## $ Fjob            <fctr> teacher, other, other, services, other, other...
## $ reason          <fctr> course, course, other, home, home, reputation...
## $ guardian.math   <fctr> mother, father, mother, mother, father, mothe...
## $ traveltime.math <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1...
## $ studytime.math  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1...
## $ failures.math   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ schoolsup.math  <fctr> yes, no, yes, no, no, no, no, yes, no, no, no...
## $ famsup.math     <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes...
## $ paid.math       <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes,...
## $ activities.math <fctr> no, no, no, yes, no, yes, no, no, no, yes, no...
## $ nursery         <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, y...
## $ higher.math     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ internet        <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes...
## $ romantic.math   <fctr> no, no, no, yes, no, no, no, no, no, no, no, ...
## $ famrel.math     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4...
## $ freetime.math   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4...
## $ goout.math      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4...
## $ Dalc.math       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Walc.math       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2...
## $ health.math     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2...
## $ absences.math   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0,...
## $ G1.math         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14,...
## $ G2.math         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14,...
## $ G3.math         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14...
## $ guardian.por    <fctr> mother, father, mother, mother, father, mothe...
## $ traveltime.por  <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1...
## $ studytime.por   <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1...
## $ failures.por    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ schoolsup.por   <fctr> yes, no, yes, no, no, no, no, yes, no, no, no...
## $ famsup.por      <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes...
## $ paid.por        <fctr> no, no, no, no, no, no, no, no, no, no, no, n...
## $ activities.por  <fctr> no, no, no, yes, no, yes, no, no, no, yes, no...
## $ higher.por      <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic.por    <fctr> no, no, no, yes, no, no, no, no, no, no, no, ...
## $ famrel.por      <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4...
## $ freetime.por    <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4...
## $ goout.por       <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4...
## $ Dalc.por        <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Walc.por        <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2...
## $ health.por      <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2...
## $ absences.por    <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6...
## $ G1.por          <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, ...
## $ G2.por          <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12...
## $ G3.por          <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13...
alc <- math_por %>%
  select(one_of(join_by))

# the columns in the datasets which were not used for joining the data
notjoined_columns <- colnames(math)[!colnames(math) %in% join_by]

Now we do the most difficult thing of the data wrangling. We select from the combined dataset variables that have same original names, but different suffixes and if those variables have numerical values we take an average of the two columns by rows with the rowMeans. I was trying to implement this with my own code, but it turned out to be very demanding (maybe I just did not find the right function, but if you got any ideas feel free to comment). After this procedure I created two new variables “alc_use” and “high_use”. Finally, I wrote the data into a csv-file.

# for every column name not used for joining... 
# copied from Data camp as I could not get my alternative solution for this to work
for(column_name in notjoined_columns) {

  two_columns <- select(math_por, starts_with(column_name))

  first_column <- select(two_columns, 1)[[1]]
  if(is.numeric(first_column)) {
    alc[column_name] <- round(rowMeans(two_columns))
  } else { 
    alc[column_name] <- first_column
  }
}

glimpse(alc)
## Observations: 382
## Variables: 33
## $ school     <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize    <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus    <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob       <fctr> teacher, other, other, services, other, other, oth...
## $ reason     <fctr> course, course, other, home, home, reputation, hom...
## $ nursery    <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet   <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian   <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup     <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid       <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic   <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel     <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
alc <- alc %>%
  mutate(alc_use = (Walc + Dalc)/2, high_use = alc_use > 2) 

glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize    <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus    <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob       <fctr> teacher, other, other, services, other, other, oth...
## $ reason     <fctr> course, course, other, home, home, reputation, hom...
## $ nursery    <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet   <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian   <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup     <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid       <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic   <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel     <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
write.csv(alc, file = "alcohol_data.csv", row.names = F)

Now as the data is looking as instructed (382 observations and 35 variables) we can move to the analysis part.

Analysis

As a brief recap our dataset consists of 35 variables and 382 observations. You can see the previous part to know more about the data set. Here I quickly illustrate how the “alc_use” variable looks like. Remember that high_use is counted as TRUE as alc_use is greater than 2.

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc)
## [1] 382  35
g1 <- ggplot(data = alc, aes(x = alc_use, fill = sex))
g1 + geom_bar()

I chose these four as my explanatory variables:

  • absences: number of school absences (numeric: from 0 to 93)
  • famrel: quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  • health: current health status (numeric: from 1 - very bad to 5 - very good)
  • goout: going out with friends (numeric: from 1 - very low to 5 - very high)

Absences hypothesis: high absence should indicate high alcohol consumption (people do not go classes drunk or hangover)

Family relations hypothesis: bad relations (low famrel) could increase the odds of having high alcohol consumption

Health hypothesis: low health may cause alchol consumption, but might also be an effect of it

Going out hypothesis: alcoholism is also connected to anti-social behavior in older age, but we should expect drinking to be social act for students of this age

Now let’s plot the data regarding selected variables. For the sace of clarity I will classify answers given to the variables of interest as “low” or “high”. I will provide boxplots for the grouped variables and also scatterplots for the original unclassified values of the variables. I will use sex to label/group observations on the plots.

repeter <- function(var_name){
  alc$new <- as.numeric(cut_number(var_name,2))
  factor(alc$new, labels = c("low", "high"))
}

# change variable name and parameter to create factor levels
alc$goout_group <- repeter(alc$goout)
alc$abs_group <- repeter(alc$absences)
alc$famrel_group <- repeter(alc$famrel)
alc$health_group <- repeter(alc$health)

plotData <- alc %>%
  select(goout_group, abs_group, famrel_group, health_group, sex, alc_use)
plotData2 <- alc %>%
  select(goout, absences, famrel, health, sex, alc_use)

plotData %>%
  gather(-alc_use, -sex, key = "var", value = "value") %>% 
  ggplot(aes(x = value, y = alc_use, color = sex)) +
  geom_boxplot() +
  facet_wrap(~ var, scales = "free") + ylab("Alchol Consumption") + xlab(" ") + ggtitle("Alcohol Consumption by 4 Variables")

plotData2 %>%
  gather(-alc_use, -sex, key = "var", value = "value") %>% 
  ggplot(aes(x = value, y = alc_use, color = sex)) +
  geom_point(position = "jitter") +
  facet_wrap(~ var, scales = "free") + ylab("Alchol Consumption") + xlab(" ") + ggtitle("Alcohol Consumption by 4 Variables")

#summarising the data
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = mean(age), mean_absences = mean(absences),
                                              mean_health = mean(health), mean_goout = mean(goout),
                                              mean_famrel = mean(famrel), mean_grade = mean(G3))
## Source: local data frame [4 x 9]
## Groups: sex [?]
## 
##      sex high_use count mean_age mean_absences mean_health mean_goout
##   <fctr>    <lgl> <int>    <dbl>         <dbl>       <dbl>      <dbl>
## 1      F    FALSE   156 16.63462      4.224359    3.378205   2.955128
## 2      F     TRUE    42 16.50000      6.785714    3.404762   3.357143
## 3      M    FALSE   112 16.31250      2.982143    3.714286   2.714286
## 4      M     TRUE    72 16.95833      6.125000    3.875000   3.930556
## # ... with 2 more variables: mean_famrel <dbl>, mean_grade <dbl>

From the plots we can see that gender was important in predicting alcohol usage. Men drink more and this is visible in all plots where as females do not seem to be affected by the effect of different variables. Regarding our hypotheses we can see that health does not sem to have a strong relation to alcohol consumption, but other variables do have. Low family relations make men drink more. High value of going out is strongly connected to increased alcohol consumption in men. Those students who are absent from class tend to also drink more. From the summary table we can see numerical data between high_use variable and other interesting variables.

Now we can do the logistic regression for high_use variable.

#logistic regression

m <- glm(high_use ~ absences + famrel + goout + health, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + famrel + goout + health, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7888  -0.7668  -0.5299   0.9199   2.3872  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.92507    0.70158  -4.169 3.06e-05 ***
## absences     0.07449    0.02150   3.465 0.000531 ***
## famrel      -0.38409    0.13783  -2.787 0.005326 ** 
## goout        0.78580    0.12050   6.521 6.98e-11 ***
## health       0.17119    0.09200   1.861 0.062775 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 392.24  on 377  degrees of freedom
## AIC: 402.24
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)    absences      famrel       goout      health 
## -2.92506735  0.07449141 -0.38409075  0.78580008  0.17118709
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
cbind(OR, CI)
##                     OR     2.5 %    97.5 %
## (Intercept) 0.05366108 0.0130406 0.2058018
## absences    1.07733608 1.0339972 1.1265093
## famrel      0.68106961 0.5180869 0.8911229
## goout       2.19416174 1.7431850 2.7986922
## health      1.18671275 0.9932560 1.4258286
probabilities <- predict(m, type = "response")
alc <- alc %>%
  mutate(probability = probabilities) %>%
  mutate(prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction =  alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   246   22
##    TRUE     69   45
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point(position = "jitter")

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64397906 0.05759162 0.70157068
##    TRUE  0.18062827 0.11780105 0.29842932
##    Sum   0.82460733 0.17539267 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2382199

As we interpret the output we can see that the estimate for the health-variable’s coefficient ranges from 0.9932 ta 1.4258 (95% confidence interval). This means that the coefficient can be either negative (reduces the odds of being high_use) or it can be positive, therefore the variable coefficient is not statistically signifficant. Family relations has a negative relationship with high_use, but the rest have a postive relationship.

After this I calculated the crosstabulation and there we can see the number of misclassifications and rightly classified instances. After that I defined the loss function and calculated the prediction error (using the whole data as a training set). Model’s training error is 0.238 and thus model accuracy is then 76.2%. This is signifficantly better than random classifier which is on average 50% right at the time.

Finally, this is the BONUS task. One was supposed to try to find a better model and use 10-fold cross validation.

#health does not affect statistically so we will replace it with sex
m <- glm(high_use ~ absences + famrel + goout + sex, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + famrel + goout + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7151  -0.7820  -0.5137   0.7537   2.5463  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.76826    0.66170  -4.184 2.87e-05 ***
## absences     0.08168    0.02200   3.713 0.000205 ***
## famrel      -0.39378    0.14035  -2.806 0.005020 ** 
## goout        0.76761    0.12316   6.232 4.59e-10 ***
## sexM         1.01234    0.25895   3.909 9.25e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 379.81  on 377  degrees of freedom
## AIC: 389.81
## 
## Number of Fisher Scoring iterations: 4
#cross-validation
library(boot)
cv <- alc %>%
  cv.glm(cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2094241

Here we can see clearly that the test set performance is much better than DataCamp’s model which had about 0.26 error.


2. Regression and Model Validation

Data Wrangling

First, I started with data wrangling exercise. I read the data from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt.

library(Matrix)
library(ggplot2)
library(dplyr)

setwd("/Users/veikko/Desktop/datakurssi/IODS-project/data")
lrn14 <- read.table("JYTOPKYS3-data.txt", header = T, sep = "\t")
str(lrn14)
## 'data.frame':    183 obs. of  60 variables:
##  $ Aa      : int  3 2 4 4 3 4 4 3 2 3 ...
##  $ Ab      : int  1 2 1 2 2 2 1 1 1 2 ...
##  $ Ac      : int  2 2 1 3 2 1 2 2 2 1 ...
##  $ Ad      : int  1 2 1 2 1 1 2 1 1 1 ...
##  $ Ae      : int  1 1 1 1 2 1 1 1 1 1 ...
##  $ Af      : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ ST01    : int  4 4 3 3 4 4 5 4 4 4 ...
##  $ SU02    : int  2 2 1 3 2 3 2 2 1 2 ...
##  $ D03     : int  4 4 4 4 5 5 4 4 5 4 ...
##  $ ST04    : int  4 4 4 4 3 4 2 5 5 4 ...
##  $ SU05    : int  2 4 2 3 4 3 2 4 2 4 ...
##  $ D06     : int  4 2 3 4 4 5 3 3 4 4 ...
##  $ D07     : int  4 3 4 4 4 5 4 4 5 4 ...
##  $ SU08    : int  3 4 1 2 3 4 4 2 4 2 ...
##  $ ST09    : int  3 4 3 3 4 4 2 4 4 4 ...
##  $ SU10    : int  2 1 1 1 2 1 1 2 1 2 ...
##  $ D11     : int  3 4 4 3 4 5 5 3 4 4 ...
##  $ ST12    : int  3 1 4 3 2 3 2 4 4 4 ...
##  $ SU13    : int  3 3 2 2 3 1 1 2 1 2 ...
##  $ D14     : int  4 2 4 4 4 5 5 4 4 4 ...
##  $ D15     : int  3 3 2 3 3 4 2 2 3 4 ...
##  $ SU16    : int  2 4 3 2 3 2 3 3 4 4 ...
##  $ ST17    : int  3 4 3 3 4 3 4 3 4 4 ...
##  $ SU18    : int  2 2 1 1 1 2 1 2 1 2 ...
##  $ D19     : int  4 3 4 3 4 4 4 4 5 4 ...
##  $ ST20    : int  2 1 3 3 3 3 1 4 4 2 ...
##  $ SU21    : int  3 2 2 3 2 4 1 3 2 4 ...
##  $ D22     : int  3 2 4 3 3 5 4 2 4 4 ...
##  $ D23     : int  2 3 3 3 3 4 3 2 4 4 ...
##  $ SU24    : int  2 4 3 2 4 2 2 4 2 4 ...
##  $ ST25    : int  4 2 4 3 4 4 1 4 4 4 ...
##  $ SU26    : int  4 4 4 2 3 2 1 4 4 4 ...
##  $ D27     : int  4 2 3 3 3 5 4 4 5 4 ...
##  $ ST28    : int  4 2 5 3 5 4 1 4 5 2 ...
##  $ SU29    : int  3 3 2 3 3 2 1 2 1 2 ...
##  $ D30     : int  4 3 4 4 3 5 4 3 4 4 ...
##  $ D31     : int  4 4 3 4 4 5 4 4 5 4 ...
##  $ SU32    : int  3 5 5 3 4 3 4 4 3 4 ...
##  $ Ca      : int  2 4 3 3 2 3 4 2 3 2 ...
##  $ Cb      : int  4 4 5 4 4 5 5 4 5 4 ...
##  $ Cc      : int  3 4 4 4 4 4 4 4 4 4 ...
##  $ Cd      : int  4 5 4 4 3 4 4 5 5 5 ...
##  $ Ce      : int  3 5 3 3 3 3 4 3 3 4 ...
##  $ Cf      : int  2 3 4 4 3 4 5 3 3 4 ...
##  $ Cg      : int  3 2 4 4 4 5 5 3 5 4 ...
##  $ Ch      : int  4 4 2 3 4 4 3 3 5 4 ...
##  $ Da      : int  3 4 1 2 3 3 2 2 4 1 ...
##  $ Db      : int  4 3 4 4 4 5 4 4 2 4 ...
##  $ Dc      : int  4 3 4 5 4 4 4 4 4 4 ...
##  $ Dd      : int  5 4 1 2 4 4 5 3 5 2 ...
##  $ De      : int  4 3 4 5 4 4 5 4 4 2 ...
##  $ Df      : int  2 2 1 1 2 3 1 1 4 1 ...
##  $ Dg      : int  4 3 3 5 5 4 4 4 5 1 ...
##  $ Dh      : int  3 3 1 4 5 3 4 1 4 1 ...
##  $ Di      : int  4 2 1 2 3 3 2 1 4 1 ...
##  $ Dj      : int  4 4 5 5 3 5 4 5 2 4 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
dim(lrn14)
## [1] 183  60

Then I modified the dataset. I followed the instructions and created attitude, deep, stra and surf variables by combining questions in the learning2014 data.

deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06",  "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")

# select the columns related to deep learning and create column 'deep' by averaging
deep_columns <- select(lrn14, one_of(deep_questions))
lrn14$deep <- rowMeans(deep_columns)

# select the columns related to surface learning and create column 'surf' by averaging
surface_columns <- select(lrn14, one_of(surface_questions))
lrn14$surf <- rowMeans(surface_columns)

# select the columns related to strategic learning and create column 'stra' by averaging
strategic_columns <- select(lrn14, one_of(strategic_questions))
lrn14$stra <- rowMeans(strategic_columns)

Then I scaled all combination variables to the original scale and finally, excluded observations where the exam points variable is zero. My final list of variables was then: gender, age, attitude, deep, stra, surf and points.

colnames(lrn14)[57] <- "age"
colnames(lrn14)[58] <- "attitude"
colnames(lrn14)[59] <- "points"

varsToGet <- c("gender", "age", "attitude", "deep", "stra", "surf", "points")
new_lrn14 <- select(lrn14, one_of(varsToGet))

new_lrn14 <- filter(new_lrn14, points > 0)

After this, I changed my working directory to the IODS project folder and saved the analysis dataset to the folder as a csv-file.

setwd("/Users/veikko/Desktop/datakurssi/IODS-project")
getwd()
## [1] "/Users/veikko/Desktop/datakurssi/IODS-project"
write.csv(new_lrn14, file = "learning2014.csv", sep = ",")

Then I wanted to check that everything went as planned, so I read the just created csv-file.

learning2014 <- read.csv("learning2014.csv", sep = ",", header = T)
learning2014 <- learning2014[-1]

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53       37 3.583333 3.375 2.583333     25
## 2      M  55       31 2.916667 2.750 3.166667     12
## 3      F  49       25 3.500000 3.625 2.250000     24
## 4      M  53       35 3.500000 3.125 2.250000     10
## 5      M  49       37 3.666667 3.625 2.833333     22
## 6      F  38       38 4.750000 3.625 2.416667     21
#Almost forgot to scale the attitude variable
learning2014$attitude <- learning2014$attitude / 10
head(learning2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Analysis

Just to briefly repeat, the dataset consists of seven variables: gender, age, attitude, deep, stra, surf and points. In this analysis points is the dependent variable.

  • age: Age (in years) derived from the date of birth
  • gender: Gender of the respondent M (Male), F (Female)
  • points: Exam points
  • attitude: Global attitude toward statistics
  • deep: average of deep learning questions
  • stra: average of strategic learning questions
  • surf: average of concentration capabilities questions

I have already, in previous section, provided a summary of the variables, which shows numerically min, max, mean, median and quantile infromation regarding each variable. Next, I will show three graphical illustrations about the data. First, I will show an overview of the data with a ggpairs function observations grouped by gender. Then I’ll make a histogram of the points variable. Finally, I am going to group the data by exam performance into three groups (as close to equal size as possible) and provide a ggpairs illustration with that grouping.

library(GGally)

#first ggpairs 
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), 
             lower = list(combo = wrap("facethist", bins = 20)))

#drawing a histogram of points
tbl <- table(learning2014$points) 
f <-  factor(learning2014$points, levels = 7:33)
barplot(table(f), xlab = "Points", ylab = "Count")

#splitting the data into 3 groups based on points variable
learning2014$pgroup <- as.numeric(cut_number(learning2014$points,3))
learning2014$pgroup2 <- factor(learning2014$pgroup, labels = c("low", "middle", "high"))
table(learning2014$pgroup2)
## 
##    low middle   high 
##     68     45     53
#drawing a second ggpairs 
test.learn <- subset(learning2014, select = c("age","attitude", "deep", "stra", "surf", "points", "pgroup2"))
ggpairs(test.learn, mapping = aes(col = pgroup2, alpha = 0.3), 
        lower = list(combo = wrap("facethist", bins = 20)))

We can see many interesting things from the plots previously drawn. First of all, there is a gender based difference: men are doing better in terms of exam points than women, but this would require statistical test in order to determine whether the difference is statistically signifficant or not. From the first ggpairs matrix, one can also see the correlation coefficients and that attitude has the highest correlation with points. High attitude correlates with high exam points. Similarly, stra variable has a positive correlation with points. Surf, deep and age on the other hand, seem to be negatively correlated with points. Correlation coefficients are different for men and women, but those differences do not seem to be that high, if we just look at points vs. other variables (correlation coefficients are within 0.1 range between genders in case of same variables).

From the two other plots we can see how the points are distributed. As an extra I grouped the data by dividing points results into three categories “low”, “middle” and “high”. In the second ggpairs matrix one can see that there is much more variation between correlation coefficients of different groups than in comparison to grouping by gender.

I chose not to follow the homework instruction so strictly at this point, because the idea of manually iterating through the model selection seems pointless. One should also remember that it matters in which order one includes the variables into the model, so one should not pointlessly start experimenting with different variables. Our data is luckily low in dimensions, so using the best subset selection is a possibility. I reckoned that I would let the best subset selection do the variable selection for me; to present me with the best model with three variables (as requested in the exercise instructions). Here is how I did it:

library(leaps)

#let's remove those two additional variables
drops <- c("pgroup", "pgroup2")
learning2014_org <- learning2014[ , !(names(learning2014) %in% drops)]

#model validation by best subset selection
regfit.full <- regsubsets(points~.,learning2014_org)
reg.summary <- summary(regfit.full)

#some validation metrics
which.max(reg.summary$adjr2)
## [1] 5
which.min(reg.summary$cp )
## [1] 3
which.min(reg.summary$bic )
## [1] 1
#let's try to do just one plot
par(mfrow = c(1,1))

plot(reg.summary$cp ,xlab="Number of Variables ",ylab="Cp", type="l")
points(3,reg.summary$cp[3],col="red",cex=2,pch=20)

#here we select the best model with 3 variables
coef(regfit.full ,3)
## (Intercept)         age    attitude        stra 
## 10.89542805 -0.08821869  3.48076791  1.00371238

Apparrently, choosing three variables is not such a bad idea, as Cp validation metric recommends three variables to be in the ideal model. Most importantly, now we know the variables for the best three variable model: attitude, stra and age. Next we do the model fitting with those variables.

lm.fit <- lm(points ~ attitude + stra + age, data=learning2014_org)
summary(lm.fit)
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014_org)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## age         -0.08822    0.05302  -1.664   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08
par(mfrow = c(2,2))
plot(lm.fit, which = c(1,2,5))

In the output we can see the model fit and some important information regarding the model. attitude seems to be the only statistically signifficant variable, the others have a p-value higher than 0.5. Age has a mild negative coefficient and affects negatively to the points variable. Gaining one point higher stra value increases points by one and attitude by almost 3.5. R^2 is 0.2182 and Adjusted R^2 is 0.2037, which mean that the model explains 1/5 of the variance of the dependent variable. Adjusted R^2 is a measure for comparing different models as R^2 increases naturally by including more variables, as this is not the case with Adjusted R^2. F-statistic tells us that coefficients are not zero.

We can observe from the regression diagnostic plots that there are no huge problems with the residuals (no signs of heteroscedasticity) and there does not seem to be individual points with model manipulative amounts of leverage.


1. Tools and methods for open and reproducible research

I am very excited about this Introduction to Open Data Science course, as I feel that it offers an opportunity to learn data science and statistical tools, which I have been using in my own projects. I think that the Datacamp learning environment was a good choice for the course and it is very easy to use.

Here is a link to my github repository: IODS course